Spark: Partition Primary Index

Teradata provides additional new indexing options to provide even more flexibility in implementing a Teradata database. One of these new indexing options is the Partitioned Primary Index (PPI).Primary indexes can be partitioned or non-partitioned.

A non-partitioned primary index (NPPI) is the traditional primary index by which rows are assigned to AMPs. The idea behind the Teradata Partitioned Primary Index is to keep the records of a table packed together in clearly delineated containers of which there are not too many in total, about the size of the table.
In that sense, Partitioning is just a different way of structuring the data records on the disks. In the case of partitioned tables (PPI Tables), the Primary Index still determines the responsible AMP of each row.
Only the way the rows are stored in the data blocks is different: While non partitioned tables store their rows sorted by rowid only, partitioned tables (PPI tables) store rows at first inside the corresponding partitions and only afterward sorted by the rowid.
A partitioned primary index (PPI) permits rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries that are predicated on primary index values. The Partitioned Primary Index (PPI) feature allows a class of queries to access a portion of a large table, instead of the whole table.
The traditional uses of the Primary Index (PI) for data placement and rapid access of the data when the PI values are specified are retained. Some common business queries generally require a full-table scan of a large table, even though it’s predictable that a fairly small percentage of the rows will qualify.
Partitioning will not help to avoid table scans, but it will reduce the number of data blocks that each AMP must move to its local memory (FSG cache) because only relevant partitions need to be accessed.
Whenever a data record is accessed, the cylinder index of the AMP is queried to find out on which cylinder the first data block of the accessed partition is located. After positioning at the first data block, all remaining data blocks of the partition (or subpartitions in the case of multilevel partitioning) can be fetched into the FSG cache by scanning in a sequence with a minimum number of disk IO’s.
One example of such a query is a trend analysis application that compares current month sales to the previous month, or to the same month of the previous year, using a table with several years of sales detail. Another example is an application that compares customer behavior in one (fairly small) geographic region to another region.
The PRIMARY INDEX clause (part of the CREATE TABLE statement) has been extended to include a PARTITION BY clause. This new partition expression definition is the only thing that needs to be done to create a partitioned table.

Advantages to this approach are:-

No separate partition layout.
No disk layout for partitions.
No definition of location in the system for partition.
No need to define/manage separate tables per segment of the table that needs to be accessed.
Even data distribution and even processing of a logical partition is automatic due to the PI distribution of the rows.

A non-partitioned table (NPPI) has a traditional primary index by which rows are assigned to AMPs. Apart from maintaining their storage in row hash order, no additional assignment processing of rows is performed once they are hashed to an AMP. With a NPPI table, the PARSER will include Partition Number 0 in the request. For a table with a NPPI, all of the rows are assumed to be part of one partition (Partition 0).

Assuming that an SQL statement (e.g., SELECT) provides equality value(s) to the column(s) of a Primary Index, the TD Database software retrieves the row or rows from a single AMP as described below.

The Parsing Engine (PE) creates a four-part message composed of the Table ID, Partition #0, the Row Hash, and Primary Index value(s).

The 48-bit Table ID is located via the Data Dictionary, the 32 bit Row Hash value is generated by the Hashing Algorithm, and the Primary Index value(s) come from the SQL request. The Parsing Engine (via the Data Dictionary) knows if a table has a NPPI and sets the Partition Number to 0.

The Message Passing Layer uses a portion of the Row Hash to determine to which AMP to send the request. The Message Passing Layer uses the HBN portion of the Row Hash (first 16 or 20 bits of the Row Hash) to locate a bucket in the Hash Map(s). This bucket identifies to which AMP the PE will send the request. The Hash Maps are part of the Message Passing Layer interface.

The AMP uses the Table ID and Row Hash to identify and locate the proper data block, then uses the Row Hash and PI value to locate the specific row(s). The PI value is required to distinguish between Hash Synonyms. The AMP implicitly assumes the rows are part of partition #0.

Note: The Partition Number (effectively 0) is not stored within the data rows for a table with a NPPI. The FLAG or SPARE byte (within the row overhead) has a bit set to zero for a NPPI row and it is set to one for a PPI row.
Acronyms:
HBN – Hash Bucket Number
PPI – Partitioned Primary Index
NPPI – Non-Partitioned Primary Index

How to Define PPI
create table emp

(emp_id integer not null,
dept_no integer not null,
Fname Varchar(20) ,
Lname Vrachar(20),
Salary decimal(10,2)
) Primary Index(emp_id)

partition by(dept_no);

create table order
(
order_id integer not null,
cust_id integer not null,
order_date date ,
order_amount integer
)
Primary Index (cust_id)
partition by case_n
(
order_amount < 10000 ,
order_amount < 20000 ,
order_amount < 30000,
no case or unkown
) ;

Partition by RANGE

CREATE TABLE ORDER_Table

(

ORDER_ID integer NOT NULL,

CUST_ID integer NOT NULL,

ORDER_DATE date ,

ORDER_AMOUNT integer

)

PRIMARY INDEX (CUST_ID)

PARTITION BY RANGE_N (

ORDER_DATE BETWEEN date ‘2012-01-01’ and ‘2012-12-31’ Each interval ‘1’ Month,NO range OR UNKNOWN ) ;

If we use NO RANGE or NO CASE – then all values not in this range will be in a single partition.

If we specify UNKNOWN, then all null values will be placed in this partition

It is a remarkable technical detail that Teradata internally considers all tables to be partitioned. NPPI tables are nothing else than PPI tables with exactly one partition, namely partition number zero, containing all table rows. You should always collect statistics on the dummy column PARTITION, even for not partitioned tables, as this is used by the optimizer to estimate the table cardinality of NPPI tables.

The process of accessing chunks of data along the partitioning attributes is often called partition elimination

You may wonder in which way partitioning is different from indexing techniques on Teradata.

Maybe the most notable difference is that all Teradata index types consume permanent space. Any Teradata index is a sub-table redundantly storing a subset of columns of its related base table.

While we still experience an impact on the occupied permanent disk space, this is not related to the redundant storage of columns but caused by the overhead of storing the records in a partitioned way: PPI table rows are 2 to 8 bytes wider than the equivalent not partitioned table row. The extra bytes are used to store the internal partition number of each row.

Traditionally, one of the primary applications for a PPI is partitioning by a date. Most reports have a date dimension as part of their reporting filters. Consider a call detail table in telecommunication companies. These are huge tables holding detailed call information about each single call for a particular time frame. Analytics take place directly on this one big table, and often along with joins to other tables are not mandatory in providing the query result.

Imagine a call detail table containing a full year of data on a daily level, but you only need one day in your report result dataset. If you partition this table by date, each AMP can restrict access exactly to this one date i.e. only 1/365 of the rows have to be moved from the disks into the AMPs memory (FSG cache).

Partitioning should always be considered over the use of other index types if ranges of data records have to be accessed together. Typical workloads are aggregations (“sum up all account balances for my business segment” etc.) which you often find in reporting requirements.

On the other hand, certain indexing techniques that allow direct access to the data blocks (primary index, unique secondary index, join index with matching primary index definition, etc.) are usually right for OLTP applications (“Give me the customer name for the client where customer id = 123”).

As you can combine partitioned tables with other index types, you can improve usability even further.

There is three kind of partitioning supported: A simple partitioning by column, the CASE_N syntax, and the RANGE_N syntax.

As the names already indicate, RANGE_N allows you to create ranges of rows which will end up in the same partition and CASE_N allows you to use the simple CASE … WHEN statement.

Here is one example of an often used RANGE_N partitioning by date:

CREATE TABLE <TABLE>
(
PK INTEGER, MyDate DATE)
PRIMARY INDEX (MyDate)
PARTITION BY RANGE_N (MyDate BETWEEN DATE ‘2014-01-01’ AND DATE ‘2020-12-31’ EACH INTERVAL ‘1’ DAY, NO RANGE, UNKNOWN);

As you can see from the above example, there exist two distinct partitions which you can add to each table:

All rows not matching any of the defined partitions will end up in the NO RANGE bucket. If you do not specify a UNKNOWN partition and try to insert records not matching any specified partition, you will receive an error message, and the insert will fail.

The UNKNOWN partition is used to hold rows with NULL values in the partition expression.

Here is one example for a CASE_N partitioning:

CREATE TABLE <TABLE>
(
PK INTEGER, MyValue INTEGER)
PRIMARY INDEX (MyDate)
PARTITION BY CASE_N (MyValue < 1000 MyValue < 2000, MyValue < 3000, NO RANGE, UNKNOWN);

Again, we can add the two special partitions as required.

CASE_N partitioning should be utilized if you want to group values together into partitions.

Partitioning is not restricted to one level. Multilevel Partitioning allows the partitioning of tables on more than one level. Here is one example:

CREATE TABLE <TABLE>
(
PK INTEGER, MyValue INTEGER)
PRIMARY INDEX (MyDate)
PARTITION BY (
CASE_N (MyValue < 1000 MyValue < 2000, MyValue < 3000, NO RANGE, UNKNOWN),
RANGE_N (MyDate BETWEEN DATE ‘2014-01-01’ AND DATE ‘2020-12-31’ EACH INTERVAL ‘1’ DAY, NO RANGE, UNKNOWN));

Just keep in mind that some partitions are limited. Up to release 13.10 of Teradata, the maximum number of partitions was 65535. Pay attention to the fact that this is the number of combined partitions in the case of multi-level partitioning and don’t forget to count the NO RANGE and UNKNOWN partitions as well. You have to multiply the number of partitions on each level to get the overall number of partitions in use. Starting with Teradata 14.10 it is possible to have more than 9.000 Quintillion partitions.
When does Teradata partition elimination take place?

Many misconceptions exist about when partition elimination takes place.

First, you have to distinguish between partition elimination and primary index access. Partition elimination occurs independently of any involvement of the Primary Index in a query.

Although the primary index choice plays a significant role in partition decisions, most of the time it is the elimination of partitions that accounts for the vast leap in IO reduction.

These are the possible performance patterns for partitioned primary index access. They are dependent on how PI and Partitions are referenced:

teradata ppi

Furthermore, for multi-level partitioning, you do not have to include all partitions in the WHERE condition to be able to cut partitions. Each of the partitions can be addressed independently, which is indeed a very nice feature in Teradata.

Now for the limitations of the primary index choice in the context of partitioning.

Many of us have heard about a performance difference between tables including all partition columns in the primary index and tables that do not include the partition columns in the primary index. Also, if tables are not including the partitioning columns, the primary index cannot be unique. What’s behind all this?

To understand the relation between primary index uniqueness and partitioning, we have to recall how rows are stored on the disks:

In the first step, the rows are hashed to the responsible AMPs. In the second phase, they are inserted into the proper partitions and sorted by the ROWID.

Imagine at first a table with a non-unique primary index: Many rows can have the same primary index and will hash to the same AMP. However, each of these rows could belong to a different partition.

In the case of a not partitioned tables, all rows with the same hash are stored together, in the event of a partitioned table the rows are scattered across different partitions.

Now imagine it was allowed to have a partitioned table with a unique primary index and without the partition column being part of the primary index. Any update or insert statement would need Teradata to check each partition to avoid the creation of duplicates. Having to check each partition is very inefficient from a performance point of view.

Also, keep in mind that if the primary index is not including the partitioning columns, each time a primary index access is required, the responsible AMP has to scan all its partitions for this particular primary index. The scan of all partitions can be avoided if you include the partition columns into the primary index. I consider this a minor problem, more related to OLTP.

Table partitioning has other impacts on performance as well. Especially when joining a non-partitioned and a partitioned table together, Teradata has to apply different join techniques for non-partitioned tables. A Sliding Window merge join is one of the join techniques related to partitioning. Another option is to “de-partition” the partitioned table and doing a regular merge join or to partition the table which is not partitioned.

Similar to the joining of two non-partitioned tables, from a performance point of view, it is best to have the same primary index on both tables, and a join is taking place on all primary index columns and additionally on the partition columns. Anything else requires less efficient join techniques.

Hence, while the elimination of partitions may significantly reduce disk IO’s you have to keep an eye on its join performance impact.

Some more benefits arise when using partitioned tables: The records of complete partitions can be removed from the table with a simple statement and, more important, without involving the transaction log:

MODIFY DROP RANGE BETWEEN DATE ‘2014-01-01’ AND DATE ‘2013-06-30’ EACH INTERVAL ‘1’ DAY;

Finally, here are the “best practice” statistic recommendations for partitioned tables:

Table PI
Partitioning column(s)
Columns in any not partitioned table that are equated to the partitioning column of the partitioned table
The system-derived PARTITION dummy column of all partitioned tables

Spark

Partition Primary Index

No comments:

Post a Comment